The data used is Population Growth, Fertility and Mortality Indicators.csv, tells about the number of some variables related to population fertility and mortality of each country around the world.
We have some variables from the data, and they are :
T03 The country code
Population.growth.and.indicators.of.fertility.and.mortality The country list
X The year column
X.1 Variable which contains some indicators, this variable is going to be spread to some variables.
X.2 The values of the observations.
X.3 Footnotes
X.4 Data source
Assume that we are going to classify countries listed based on the indicators contained in the data.
Libraries Importing and Data Preparation.
Exploratory Data Analyst
PCA Transformation.
Biplotting and Interpretation.
## 'data.frame': 4979 obs. of 7 variables:
## $ T03 : chr "Region/Country/Area" "1" "1" "1" ...
## $ Population.growth.and.indicators.of.fertility.and.mortality: chr "" "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" ...
## $ X : chr "Year" "2005" "2005" "2005" ...
## $ X.1 : chr "Series" "Population annual rate of increase (percent)" "Total fertility rate (children per women)" "Infant mortality for both sexes (per 1,000 live births)" ...
## $ X.2 : chr "Value" "1.3" "2.6" "49.1" ...
## $ X.3 : chr "Footnotes" "Data refers to a 5-year period preceding the reference year." "Data refers to a 5-year period preceding the reference year." "Data refers to a 5-year period preceding the reference year." ...
## $ X.4 : chr "Source" "United Nations Population Division, New York, World Population Prospects: The 2017 Revision, last accessed June 2017." "United Nations Population Division, New York, World Population Prospects: The 2017 Revision; supplemented by da"| __truncated__ "United Nations Statistics Division, New York, \"Demographic Yearbook 2015\" and the demographic statistics data"| __truncated__ ...
We only need some variables to process the data, the last 2 columns and the first column will be eliminated
There is a year column (from 2000 to 2016 ), most of the countries only have values for 2005, 2010, and 2015.
The X.1 contains 8 indicators, we’re going to spread them into their own column
| Code | Country | year | inf.mort | life.exp.both | life.exp.female | life.exp.male | maternal.mortality.ratio | pop.increase | tot.fertil.rate |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Total, all countries or areas | 2015 | 35.0 | 70.8 | 73.1 | 68.6 | 216 | 1.2 | 2.5 |
| 100 | Bulgaria | 2015 | 8.3 | 74.3 | 77.8 | 70.8 | 11 | -0.6 | 1.5 |
| 104 | Myanmar | 2015 | 45.0 | 66.0 | 68.3 | 63.7 | 178 | 0.9 | 2.3 |
| 108 | Burundi | 2015 | 77.9 | 56.1 | 58.0 | 54.2 | 712 | 3.0 | 6.0 |
| 11 | Western Africa | 2015 | 70.5 | 54.7 | 55.6 | 53.9 | NA | 2.7 | 5.5 |
| 112 | Belarus | 2015 | 3.6 | 72.1 | 77.7 | 66.5 | 4 | 0.0 | 1.6 |
Country = Country list ;
inf.mort = Infant mortality for both sexes (per 1,000 live births) ;
life.exp.both = Life expectancy at birth for both sexes (years) ;
life.exp.male = Life expectancy at birth for males (years) ;
life.exp.female = Life expectancy at birth for females (years) ;
maternal.mortality.ratio = Maternal mortality ratio (deaths per 100,000 population) ;
pop.increase = Population annual rate of increase (percent) ;
tot.fertil.rate = Total fertility rate (children per women)
| column | NA |
|---|---|
| Code | 0 |
| Country | 0 |
| year | 0 |
| inf.mort | 31 |
| life.exp.both | 31 |
| life.exp.female | 29 |
| life.exp.male | 29 |
| maternal.mortality.ratio | 73 |
| pop.increase | 0 |
| tot.fertil.rate | 29 |
There are so many NAs in the data, it means that not all country listed have the data we need.
| Code | Country | year | inf.mort | life.exp.both | life.exp.female | life.exp.male | maternal.mortality.ratio | pop.increase | tot.fertil.rate |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Total, all countries or areas | 2015 | 35.0 | 70.8 | 73.1 | 68.6 | 216.0000 | 1.2 | 2.5 |
| 100 | Bulgaria | 2015 | 8.3 | 74.3 | 77.8 | 70.8 | 11.0000 | -0.6 | 1.5 |
| 104 | Myanmar | 2015 | 45.0 | 66.0 | 68.3 | 63.7 | 178.0000 | 0.9 | 2.3 |
| 108 | Burundi | 2015 | 77.9 | 56.1 | 58.0 | 54.2 | 712.0000 | 3.0 | 6.0 |
| 11 | Western Africa | 2015 | 70.5 | 54.7 | 55.6 | 53.9 | 162.1842 | 2.7 | 5.5 |
| 112 | Belarus | 2015 | 3.6 | 72.1 | 77.7 | 66.5 | 4.0000 | 0.0 | 1.6 |
There is an odd thing on the data as we replace the NA with the average number of each column. There are some rows/countries which have no observation value or only have 1 or 2 value for their indicator and we have filled them with the average values and it’s not supposed to be like that. We supposed to eliminate them.
I create a vector that indicates whether a rows’ values are mostly the avg values of each column or not. If it is, eliminate the column.
## 'data.frame': 235 obs. of 8 variables:
## $ Country : chr "Total, all countries or areas" "Bulgaria" "Myanmar" "Burundi" ...
## $ inf.mort : num 35 8.3 45 77.9 70.5 3.6 29.9 27.7 67.5 4.7 ...
## $ life.exp.both : num 70.8 74.3 66 56.1 54.7 72.1 67.6 75.3 56.4 81.8 ...
## $ life.exp.female : num 73.1 77.8 68.3 58 55.6 77.7 69.6 76.5 57.7 83.8 ...
## $ life.exp.male : num 68.6 70.8 63.7 54.2 53.9 66.5 65.5 74.1 55.1 79.7 ...
## $ maternal.mortality.ratio: num 216 11 178 712 162 ...
## $ pop.increase : num 1.2 -0.6 0.9 3 2.7 0 1.6 2 2.7 1 ...
## $ tot.fertil.rate : num 2.5 1.5 2.3 6 5.5 1.6 2.7 3 5 1.6 ...
Continent column, we’re going to have some more insights, so let’s just do it. | Country | Continent |
|---|---|
| Total, all countries or areas | NA |
| Bulgaria | Europe |
| Myanmar | Asia |
| Burundi | Africa |
| Western Africa | NA |
| Belarus | Europe |
Some rows cannot be defined by its continent and all of them are not even a country actually. They are just regions or certain areas of the continent.
Our observations are countries so we will just eliminate rows that represent some areas or regions.
| Country | Continent |
|---|---|
| Total, all countries or areas | ? |
| Western Africa | ? |
| Central America | ? |
| Eastern Africa | ? |
| Asia | ? |
| Central Asia | ? |
| Western Asia | ? |
| Northern Africa | ? |
| Europe | ? |
| Eastern Europe | ? |
| Northern Europe | ? |
| Western Europe | ? |
| Other non-specified areas | ? |
| Middle Africa | ? |
| Southern Africa | ? |
| Africa | ? |
| Sub-Saharan Africa | ? |
| Northern America | ? |
| Caribbean | ? |
| Eastern Asia | ? |
| Southern Asia | ? |
| South-eastern Asia | ? |
| Southern Europe | ? |
| Latin America & the Caribbean | ? |
| South America | ? |
| Australia and New Zealand | ? |
| Melanesia | ? |
| Micronesia | ? |
| Polynesia | ? |
| South-central Asia | ? |
| Channel Islands | ? |
| Oceania | ? |
Country as rownames instead.| inf.mort | life.exp.both | life.exp.female | life.exp.male | maternal.mortality.ratio | pop.increase | tot.fertil.rate | Continent |
|---|---|---|---|---|---|---|---|
| 8.3 | 74.3 | 77.8 | 70.8 | 11 | -0.6 | 1.5 | Europe |
| 45.0 | 66.0 | 68.3 | 63.7 | 178 | 0.9 | 2.3 | Asia |
| 77.9 | 56.1 | 58.0 | 54.2 | 712 | 3.0 | 6.0 | Africa |
| 3.6 | 72.1 | 77.7 | 66.5 | 4 | 0.0 | 1.6 | Europe |
| 29.9 | 67.6 | 69.6 | 65.5 | 161 | 1.6 | 2.7 | Asia |
| 27.7 | 75.3 | 76.5 | 74.1 | 140 | 2.0 | 3.0 | Africa |
Now the data is ready to be proceed.
From the plot above we can conclude that :
the correlation between life expectancy of birth of male, female, and both are really high. In this case we better use the life expectancy of both
all variables have relatively strong correlation to each other but pop.increase
the pop.increase has the least correlation with other variables
Africa dominates the low life expectantion area but Europe are mostly on the high area of life expectancy . The rest are spread from the middle to the high.
Usualy the countries which infant mortality is high have less life expectantion. The infants die and the life expectantion is lower than other countries, Africa dominates this area and Europe is on the other side.
The higher fertility rate the lower life expectancy,Africa dominates this area and Europe is on the other side.
Africa dominates the area which total fertility rate is high, means that Africans are “productive”.
it’s kinda make sense countries with low fertility rate have low infant mortality number.
Usualy the countries which total fertility is high have low life expectancy.
Countries with high fertility rate tend to have high maternal mortality ratio and this still dominated by African countries.
Europe has low infant mortality number but also low population increase which is rational i think.
Most African countries and some Asian country have high pop increase and high infant mortality, it’s not really good though, it seems like they produce babies as much as possible but can’t really keep them alive until adult.
Some Asian countries even keep their infant mortality low but still their population increase greatly. And they are the “oil well” of the world.
The higher total fertility rate, the higher population increase.
Scaled data is needed to perform data clustering.
## inf.mort life.exp.both maternal.mortality.ratio pop.increase
## Min. : 1.6 Min. :49.40 Min. : 3.0 Min. :-2.300
## 1st Qu.: 6.9 1st Qu.:65.80 1st Qu.: 16.5 1st Qu.: 0.400
## Median :17.1 Median :73.00 Median : 76.0 Median : 1.300
## Mean :25.7 Mean :71.27 Mean :162.2 Mean : 1.401
## 3rd Qu.:42.1 3rd Qu.:76.90 3rd Qu.:187.5 3rd Qu.: 2.300
## Max. :94.4 Max. :83.40 Max. :882.0 Max. : 6.600
## tot.fertil.rate Continent
## Min. :1.200 Africa :57
## 1st Qu.:1.800 Americas:42
## Median :2.400 Asia :50
## Mean :2.853 Europe :40
## 3rd Qu.:3.750 Oceania :14
## Max. :7.400
## inf.mort life.exp.both maternal.mortality.ratio pop.increase
## Min. :-1.0369 Min. :-2.6949 Min. :-0.7804 Min. :-2.7139
## 1st Qu.:-0.8089 1st Qu.:-0.6741 1st Qu.:-0.7142 1st Qu.:-0.7343
## Median :-0.3700 Median : 0.2131 Median :-0.4225 Median :-0.0744
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7055 3rd Qu.: 0.6937 3rd Qu.: 0.1243 3rd Qu.: 0.6588
## Max. : 2.9556 Max. : 1.4947 Max. : 3.5298 Max. : 3.8116
## tot.fertil.rate
## Min. :-1.1726
## 1st Qu.:-0.7470
## Median :-0.3215
## Mean : 0.0000
## 3rd Qu.: 0.6359
## Max. : 3.2245
The elbow method shows that the optimum K value is 2. But i think we should try 3 as well since 2 clusters will not give us much information.
k-means modeling
cluster distribution
| Cluster | Freq |
|---|---|
| 1 | 153 |
| 2 | 50 |
| Cluster | Freq |
|---|---|
| 1 | 42 |
| 2 | 104 |
| 3 | 57 |
| Country | 2-Clusters | 3-Clusters |
|---|---|---|
| Bulgaria | 1 | 2 |
| Myanmar | 1 | 3 |
| Burundi | 2 | 1 |
| Belarus | 1 | 2 |
| Cambodia | 1 | 3 |
| Algeria | 1 | 3 |
| Cameroon | 2 | 1 |
| Canada | 1 | 2 |
| Cabo Verde | 1 | 3 |
| Central African Republic | 2 | 1 |
| Sri Lanka | 1 | 2 |
| Chad | 2 | 1 |
| Chile | 1 | 2 |
| China | 1 | 2 |
| Colombia | 1 | 2 |
| Comoros | 2 | 1 |
| Mayotte | 1 | 3 |
| Congo | 2 | 1 |
| Dem. Rep. of the Congo | 2 | 1 |
| Costa Rica | 1 | 2 |
| Croatia | 1 | 2 |
| Cuba | 1 | 2 |
| Cyprus | 1 | 2 |
| Czechia | 1 | 2 |
| Benin | 2 | 1 |
| Denmark | 1 | 2 |
| Dominican Republic | 1 | 3 |
| Ecuador | 1 | 3 |
| El Salvador | 1 | 2 |
| Equatorial Guinea | 2 | 1 |
| Ethiopia | 2 | 1 |
| Eritrea | 2 | 1 |
| Estonia | 1 | 2 |
| Faroe Islands | 1 | 2 |
| Angola | 2 | 1 |
| Fiji | 1 | 2 |
| Finland | 1 | 2 |
| France | 1 | 2 |
| French Guiana | 1 | 3 |
| French Polynesia | 1 | 2 |
| Djibouti | 2 | 3 |
| Gabon | 2 | 3 |
| Georgia | 1 | 2 |
| Gambia | 2 | 1 |
| State of Palestine | 1 | 3 |
| Germany | 1 | 2 |
| Antigua and Barbuda | 1 | 2 |
| Ghana | 2 | 1 |
| Kiribati | 1 | 3 |
| Greece | 1 | 2 |
| Greenland | 1 | 2 |
| Grenada | 1 | 2 |
| Azerbaijan | 1 | 3 |
| Guadeloupe | 1 | 2 |
| Guam | 1 | 2 |
| Argentina | 1 | 2 |
| Guatemala | 1 | 3 |
| Guinea | 2 | 1 |
| Guyana | 1 | 3 |
| Haiti | 2 | 3 |
| Honduras | 1 | 3 |
| China, Hong Kong SAR | 1 | 2 |
| Hungary | 1 | 2 |
| Iceland | 1 | 2 |
| India | 1 | 3 |
| Australia | 1 | 2 |
| Indonesia | 1 | 3 |
| Iran (Islamic Republic of) | 1 | 2 |
| Iraq | 1 | 3 |
| Ireland | 1 | 2 |
| Israel | 1 | 2 |
| Italy | 1 | 2 |
| Côte d’Ivoire | 2 | 1 |
| Jamaica | 1 | 2 |
| Japan | 1 | 2 |
| Kazakhstan | 1 | 3 |
| Afghanistan | 2 | 1 |
| Austria | 1 | 2 |
| Jordan | 1 | 3 |
| Kenya | 2 | 1 |
| Dem. People’s Rep. Korea | 1 | 2 |
| Republic of Korea | 1 | 2 |
| Kuwait | 1 | 3 |
| Kyrgyzstan | 1 | 3 |
| Lao People’s Dem. Rep. | 1 | 3 |
| Lebanon | 1 | 3 |
| Lesotho | 2 | 1 |
| Latvia | 1 | 2 |
| Liberia | 2 | 1 |
| Libya | 1 | 2 |
| Bahamas | 1 | 2 |
| Lithuania | 1 | 2 |
| Luxembourg | 1 | 2 |
| China, Macao SAR | 1 | 2 |
| Madagascar | 2 | 1 |
| Malawi | 2 | 1 |
| Malaysia | 1 | 2 |
| Maldives | 1 | 3 |
| Mali | 2 | 1 |
| Malta | 1 | 2 |
| Martinique | 1 | 2 |
| Mauritania | 2 | 1 |
| Bahrain | 1 | 2 |
| Mauritius | 1 | 2 |
| Mexico | 1 | 2 |
| Mongolia | 1 | 3 |
| Republic of Moldova | 1 | 2 |
| Montenegro | 1 | 2 |
| Bangladesh | 1 | 3 |
| Morocco | 1 | 3 |
| Mozambique | 2 | 1 |
| Armenia | 1 | 2 |
| Oman | 1 | 3 |
| Namibia | 2 | 3 |
| Barbados | 1 | 2 |
| Nepal | 1 | 3 |
| Netherlands | 1 | 2 |
| Curaçao | 1 | 2 |
| Aruba | 1 | 2 |
| New Caledonia | 1 | 2 |
| Vanuatu | 1 | 3 |
| New Zealand | 1 | 2 |
| Nicaragua | 1 | 2 |
| Belgium | 1 | 2 |
| Niger | 2 | 1 |
| Nigeria | 2 | 1 |
| Norway | 1 | 2 |
| Micronesia (Fed. States of) | 1 | 3 |
| Palau | 1 | 2 |
| Pakistan | 2 | 3 |
| Panama | 1 | 2 |
| Papua New Guinea | 2 | 3 |
| Bermuda | 1 | 2 |
| Paraguay | 1 | 3 |
| Peru | 1 | 2 |
| Philippines | 1 | 3 |
| Poland | 1 | 2 |
| Portugal | 1 | 2 |
| Guinea-Bissau | 2 | 1 |
| Timor-Leste | 2 | 1 |
| Puerto Rico | 1 | 2 |
| Qatar | 1 | 3 |
| Réunion | 1 | 2 |
| Bhutan | 1 | 3 |
| Romania | 1 | 2 |
| Russian Federation | 1 | 2 |
| Rwanda | 2 | 3 |
| Saint Lucia | 1 | 2 |
| Saint Vincent & Grenadines | 1 | 2 |
| Sao Tome and Principe | 2 | 3 |
| Bolivia (Plurin. State of) | 1 | 3 |
| Saudi Arabia | 1 | 3 |
| Senegal | 2 | 1 |
| Serbia | 1 | 2 |
| Seychelles | 1 | 2 |
| Sierra Leone | 2 | 1 |
| Bosnia and Herzegovina | 1 | 2 |
| Singapore | 1 | 2 |
| Slovakia | 1 | 2 |
| Viet Nam | 1 | 2 |
| Slovenia | 1 | 2 |
| Somalia | 2 | 1 |
| South Africa | 1 | 3 |
| Zimbabwe | 2 | 1 |
| Botswana | 1 | 3 |
| Spain | 1 | 2 |
| South Sudan | 2 | 1 |
| Sudan | 2 | 1 |
| Western Sahara | 1 | 3 |
| Suriname | 1 | 3 |
| Swaziland | 2 | 1 |
| Sweden | 1 | 2 |
| Switzerland | 1 | 2 |
| Brazil | 1 | 2 |
| Syrian Arab Republic | 1 | 2 |
| Tajikistan | 1 | 3 |
| Thailand | 1 | 2 |
| Togo | 2 | 1 |
| Tonga | 1 | 3 |
| Trinidad and Tobago | 1 | 2 |
| United Arab Emirates | 1 | 2 |
| Tunisia | 1 | 2 |
| Turkey | 1 | 2 |
| Turkmenistan | 1 | 3 |
| Albania | 1 | 2 |
| Uganda | 2 | 1 |
| Ukraine | 1 | 2 |
| TFYR of Macedonia | 1 | 2 |
| Egypt | 1 | 3 |
| United Kingdom | 1 | 2 |
| United Rep. of Tanzania | 2 | 1 |
| Belize | 1 | 3 |
| United States of America | 1 | 2 |
| United States Virgin Islands | 1 | 2 |
| Burkina Faso | 2 | 1 |
| Uruguay | 1 | 2 |
| Uzbekistan | 1 | 3 |
| Venezuela (Boliv. Rep. of) | 1 | 2 |
| Samoa | 1 | 3 |
| Yemen | 2 | 1 |
| Zambia | 2 | 1 |
| Solomon Islands | 1 | 3 |
| Brunei Darussalam | 1 | 2 |
| Dimension | eigenvalue | percentage of variance | cumulative percentage of variance |
|---|---|---|---|
| comp 1 | 3.9451216 | 78.902432 | 78.90243 |
| comp 2 | 0.6727678 | 13.455355 | 92.35779 |
| comp 3 | 0.1954976 | 3.909951 | 96.26774 |
| comp 4 | 0.1354516 | 2.709031 | 98.97677 |
| comp 5 | 0.0511615 | 1.023230 | 100.00000 |
The dimension 1 contains 80% of information and dimension 2 contains 12% information. The total is arround 92% of information.
When we divide the data into 2 clusters, we can conclude that the cluster 1 is :
the countries which have low life expectancy for male and female
the countries which have high fertility rate
the countries which have high population increase
and African countries dominate this cluster.
This cluster indicates the countries contained maybe are not a healthy country since they have low life expectancy. This countries will have more young people in the future since the are high fertility rate and the population grows rapidly.
cluster 2 is :
the countries which have high life expectancy for male and female
the countries which have low fertility rate
the countries which have low infant mortality number
the countries which have low maternal mortality ratio
and all Europe countries are in cluster2.
This cluster indicates the countries contained will tend to have less productive people in the future since the fertility rate is not really good and the population is not growing well. In this case, high life expectancy will make this countries population dominated by old people one day.
When we divide the data into 3 clusters, we can conclude that the cluster 1 is :
the countries which have low life expectancy for male and female
the countries which have high fertility rate
the countries which have high population increase
African countries still dominate this cluster
This cluster is not really different with the cluster 1 from the case before.
cluster 2 is :
the countries in the middle, their observation values are near the average.
there are some outliers in this cluster. they are countries with high population growing and low infant mortality, the “oil well” i’ve told you before.
cluster 3 is :
the countries which have high life expectancy for male and female
the countries which have low fertility rate
the countries which have low population increase
the countries which have low maternal mortality ratio
This cluster indicates the countries contained will more likely to have less young people than countries in other clusters. the have low pop. increase, fertility rate. These countries should be more “productive”.
So we’re going to see the animated plot of each country of each cluster from 2005 to 2015. We expect to see some countries change their cluster from time to time.
## 'data.frame': 541 obs. of 7 variables:
## $ PC1 : num -4.8 -3.38 -2.48 1.48 1.8 ...
## $ PC2 : num -0.674 -0.027 -0.489 0.799 1.122 ...
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
## $ clust2 : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 1 1 2 ...
## $ clust3 : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
## $ year : num 2005 2010 2015 2005 2010 ...
## $ Continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 4 4 4 1 1 1 1 ...
Some countries are moving from cluster 1 to cluster 2.
There are some countries change their cluster.
The cluster position are flipped, its because the “var” plot is different
They’re flipped 180 degrees for each arrow, so the information gained from animated plot is still valid anyway.
Based on the previous analyst, I recommend to use 3 cluster because it gives us some more information. The 2 cluster is too general while the 3 cluster is more specific.
The use of 2 cluster only give us information that there are 2 groups of country, the first which have high life expectancy, low fertility rate, and low pop. increase. and the other one is the opposite.
But when we use 3 cluster we can see the middle cluster between the extremes.